106 PART 3 Getting Down and Dirty with Data
Dealing with more than two levels in a category
When a categorical variable has more than two levels (like the Type of Caregiver or
Likert agreement scale examples we describe in the earlier section “Looking at Lev-
els of Measurement”), data storage gets even more interesting. First, you have to
ask yourself, “Is this variable a Choose only one or Choose all that apply variable?”
The coding is completely different for these two kinds of multiple-choice
variables.
You handle the Choose only one situation just as we describe for Type of Caregiver in
the preceding section — you establish numeric code for each alternative. For the
Likert scale example, if the item asked about patient satisfaction, you could have a
categorical variable called PatSat, with five possible values: 1 for strongly disagree,
2 for somewhat disagree, 3 for neither agree nor disagree, 4 for somewhat agree,
and 5 for strongly agree. And for the Type of Caregiver example, if only one kind of
caregiver is allowed to be chosen from the three choices of nurse, physician, or
social worker, you can have a categorical variable called CaregiverType with three
possible values: 1 for nurse, 2 for physician, and 3 for social worker. Depending
upon the study, you may also choose to add a 4 for other, and a 9 for unknown
(9, 99, and 999 are codes conventionally reserved for unknown). If you find
unexpected values, it is important to research and document what these mean to
help future analysts encountering the same data.
But the situation is quite different if the variable is Choose all that apply. For the
Type of Caregiver example, if the patient is being served by a team of caregivers,
you have to set up your database differently. Define separate variables in the data-
base (separate columns in Excel) — one for each possible category value. Imagine
that you have three variables called Nurse, Physician, and SW (the SW stands for
social worker). Each variable is a two-value category, also known as a two-state
flag, and is populated as 1 for having the attribute and 0 for not having the attrib-
ute. So, if participant 101’s care team includes only a physician, participant 102’s
care team includes a nurse and a physician, and participant 103’s care team
includes a social worker and a physician, the information can be coded as shown
in the following table.
Subject
Nurse
Physician
SW
101
0
1
0
102
1
1
0
103
0
1
1
If you have variables with more than two categories, missing values theoretically
can be indicated by leaving the cell blank, but blanks are difficult to analyze in
statistical software. Instead, categories should be set up for missing values so they
can be part of the coding system (such as using a numerical code to indicate